Project Goal

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, and and predict the manner in which participants did the exercise. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data Processing

i. Load data

training <- read.csv("pml-training.csv", na.strings = c("", "NA"))
testing <- read.csv("pml-testing.csv", na.strings = c("", "NA"))
# We hided results in the final output because it's too long
head(training,10)
dim(training)
## [1] 19622   160

Our obeservations: 1) the dataset is quite large. It also has a lot of variables. 2) Each sensor captures multiple datapoints. Some variables has a lot of NA values. They might not be useful and we should consider exlcude these variables. 3) The name of variable ends with the name of the particular sensor that is providing the datapoint. This allows us to group variables together and possibily conducts a PCA to reduce dementions.

ii. Remove NA columns

# We hided results in the final output because it's too long
library(dplyr)
col_na_count <- training %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))
col_na_count

Here we have count of NA in each column. As we can see, for columns with NA, NA is the dominated value - there are 19216 NAs out of 19622 rows. Let’s remove columns with more than half rows NA.

# We hided results in the final output because it's too long
training_filter <- training %>% 
    select(-which(col_na_count > 0.5 * nrow(training)))
head(training_filter)

Now we are down to 60 variables.

iii. Conduct PCA to reduce dimentions

Among the predictors, we can see they can be grouped in to groups based on their names. Let’s group them and check the correlation among variables. First let’s take a look at Belt.

# We hided results in the final output because it's too long
training_belt <- training_filter %>% select(contains("belt"))
cor(training_belt)
head(training_belt)

As we can see, a lot of variables are highly correlated, such as -0.99. We can use Principle Componenet Analysis to reduce the dementions. Notice the scales of these variables are very different and there are both positive and negative values, we will need to center and scale our data as well.

library(caret)

training_belt_pcamodel <- preProcess(training_belt, method = c("center", "scale", "pca"), thresh = 0.90)
training_belt_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 4 components to capture 90 percent of the variance

As we can see, we only need 4 components do perserve 90% of the variance. This will decrease reduce our data dimentions dramatically.

# apply PCA model
training_belt_pca <- predict(training_belt_pcamodel, newdata = training_belt)
# rename the columne names
names(training_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")

Next we will repeat the process for the rest three group of variable - forearm, arm, and dumbell.

# Create model for forearm
training_forearm <- training_filter %>% select(contains("forearm"))
training_forearm_pcamodel <- preProcess(training_forearm, method = c("center", "scale", "pca"), thresh = 0.90)
training_forearm_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 8 components to capture 90 percent of the variance
# We need 8 variables to perserve 90% variance
training_forearm_pca <- predict(training_forearm_pcamodel, newdata = training_forearm)
names(training_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5","forearm_pca6","forearm_pca7","forearm_pca8")
# Create model for arm
training_arm <- training_filter %>% select(contains("_arm"))
training_arm_pcamodel <- preProcess(training_arm, method = c("center", "scale", "pca"), thresh = 0.90)
training_arm_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 7 components to capture 90 percent of the variance
# We need 7 variables to perserve 90% variance
training_arm_pca <- predict(training_arm_pcamodel, newdata = training_arm)
names(training_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5","arm_pca6","arm_pca7")
 # Create model for dumbbell
training_dumbbell <- training_filter %>% select(contains("_dumbbell"))
training_dumbbell_pcamodel <- preProcess(training_dumbbell, method = c("center", "scale", "pca"), thresh = 0.90)
training_dumbbell_pcamodel
## Created from 19622 samples and 13 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (0)
##   - principal component signal extraction (13)
##   - scaled (13)
## 
## PCA needed 6 components to capture 90 percent of the variance
# We need 6 variables to perserve 90% variance
training_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = training_dumbbell)
names(training_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5","dumbbell_pca6")

Now let’s combine all the variables after PCA.

training_pca <- data.frame(user_name = training[, 2], classe = training$classe, training_belt_pca, training_arm_pca, training_forearm_pca, training_dumbbell_pca )

Nowe we have 33 variables. Let’s do some Exploratory Data Analysis.

Explortory Data Analysis

table(training_pca$user_name, training_pca$classe)
##           
##               A    B    C    D    E
##   adelmo   1165  776  750  515  686
##   carlitos  834  690  493  486  609
##   charles   899  745  539  642  711
##   eurico    865  592  489  582  542
##   jeremy   1177  489  652  522  562
##   pedro     640  505  499  469  497

Each of the 6 participants performed 5 classes of workout.

# plot pc1 of belt, forearm, and dumbbell, color by users
library(plotly)
plot_ly(training_pca, x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~user_name)

In this plot, we can see that user is differently a very important factor. No matter which classe is, a user’s movements seem to be in similar range, therefore grouped together.

plot_ly(subset(training_pca, user_name = "adelmo"), x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~classe)

In this plot, we only look at Adelmo’s movements. We can see that: 1) each classes has different range of movements. 2) There might be some outliers in the range of movement. For example, one data point has extremely large dumbbell_pca1. However, Because we only plotted 3 vairables, it’s hard to see if these are real outliers, so we decided to keep them.

Select models

First, let’s slice the training data into training and validation sample.

# Define train control for k fold cross validation
# train_control <- trainControl(method="cv", number= 5, savePredictions = TRUE)

set.seed(123)
index <- createDataPartition(training$classe, p = 0.6, list = FALSE)
pca_tra <- training_pca[index,]
pca_val <- training_pca[-index,]

This is a mutiple calssification problem. The models we can use are tree based models and k-nearest neighbours.

Model 1 - Random Forest

#Random Forest
model1 <- train(classe ~., data= pca_tra, method="rf")
pca_val$pred_rf <- predict(model1, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2216    8    3    2    3
##          B   56 1449   12    1    0
##          C    3   32 1322   11    0
##          D    6    0   62 1210    8
##          E    0    0   11    7 1424
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9674, 0.9749)
##     No Information Rate : 0.2907          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9637          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9715   0.9731   0.9376   0.9829   0.9923
## Specificity            0.9971   0.9891   0.9929   0.9885   0.9972
## Pos Pred Value         0.9928   0.9545   0.9664   0.9409   0.9875
## Neg Pred Value         0.9884   0.9937   0.9864   0.9968   0.9983
## Prevalence             0.2907   0.1898   0.1797   0.1569   0.1829
## Detection Rate         0.2824   0.1847   0.1685   0.1542   0.1815
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9843   0.9811   0.9652   0.9857   0.9948

Model 2 - K-nearest neighbours

#KNN Model
ctrl <- trainControl(method="repeatedcv",repeats = 3)
model2 <- train(classe ~., data= pca_tra, method="knn", trControl = ctrl)
pca_val$pred_knn <- predict(model2, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2182   24   15    9    2
##          B   81 1379   45   11    2
##          C   17   30 1283   29    9
##          D    5    4   68 1205    4
##          E    2   16   15   20 1389
## 
## Overall Statistics
##                                           
##                Accuracy : 0.948           
##                  95% CI : (0.9429, 0.9528)
##     No Information Rate : 0.2915          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9342          
##                                           
##  Mcnemar's Test P-Value : 1.762e-12       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9541   0.9491   0.8997   0.9458   0.9879
## Specificity            0.9910   0.9783   0.9868   0.9877   0.9918
## Pos Pred Value         0.9776   0.9084   0.9379   0.9370   0.9632
## Neg Pred Value         0.9813   0.9883   0.9779   0.9895   0.9973
## Prevalence             0.2915   0.1852   0.1817   0.1624   0.1792
## Detection Rate         0.2781   0.1758   0.1635   0.1536   0.1770
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9725   0.9637   0.9432   0.9668   0.9898

Prediction

Random Forest model has higher accuracy than the KNN model. It has 97% accuracy. We will use this model to predict the testing data.

Remove NA columns from testing data

col_na_count_testing <- testing %>%
  select(everything()) %>%
  summarise_all(funs(sum(is.na(.))))

testing_filter <- testing %>% 
    select(-which(col_na_count_testing > 0.5 * nrow(testing)))

Create PCA for testing data

# Create pca variables with "belt"
testing_belt <- testing_filter %>% select(contains("belt"))
testing_belt_pca <- predict(training_belt_pcamodel, newdata = testing_belt)
names(testing_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")

# Create pca variables with "forearm"
testing_forearm <- testing_filter %>% select(contains("forearm"))
testing_forearm_pca <- predict(training_forearm_pcamodel, newdata = testing_forearm)
names(testing_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5", "forearm_pca6", "forearm_pca7", "forearm_pca8")

# Create pca variables with "_arm"
testing_arm <- testing_filter %>% select(contains("_arm"))
testing_arm_pca <- predict(training_arm_pcamodel, newdata = testing_arm)
names(testing_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5", "arm_pca6", "arm_pca7")

# Create pca variables with "dumbbell"
testing_dumbbell <- testing_filter %>% select(contains("dumbbell"))
testing_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = testing_dumbbell)
names(testing_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5", "dumbbell_pca6")

# Combine groups of categories

testing_pca <- data.frame(user_name = testing[, 2], testing_belt_pca, testing_arm_pca, testing_forearm_pca, testing_dumbbell_pca )

Make prediction using Random Forest

testing_pca$pred <- predict(model1, newdata = testing_pca, type = "raw")
testing_pca$pred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E